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Abstract 

We address the problem of filling missing entries in a kernel Gram matrix, given a related full Gram ma- 
trix. We attack this problem from the viewpoint of regression, assuming that the two kernel matrices can be 
considered as explanatory variables and response variables, respectively. We propose a variant of the regression 
model based on the underlying features in the reproducing kernel Hilbert space by modifying the idea of kernel 
canonical correlation analysis, and we estimate the missing entries by fitting this model to the existing samples. 
We obtain promising experimental results on gene network inference and protein 3D structure prediction from 
genomic datasets. We also discuss the relationship with the em-algorithm based on information geometry. 

1 Introduction 

Kernel methods such as support vector machines (SVM) enable the use of powerful statistical analysis for various 
datasets as soon as kernel matrices for the dataset are available [1]. When a dataset contains N objects, the 
objects are represented a.s an N x N positive semidefinite matrix whose elements can be thought of as objects- 
object similarities. An advantage of kernel methods is that they can be applied not only to real-valued data but 
also to complex structured objects such as strings, trees and graphs [2]. The kernel matrix not only plays an 
important role as the input to kernel methods, but also provides important information regarding the similarity 
between objects. 

In this paper we consider the problem of estimating missing entries in a kernel matrix. More precisely we 
assume that two datasets describing the same objects are available; however, although all data are available for the 
first dataset, only part of the second dataset is available. In this case, a full kernel Gram matrix can be obtained 
for the first dataset, while only a partial kernel Gram matrix is obtained for the second dataset, that is, a matrix 
with missing entries. If we are more interested in the second dataset rather than in the first dataset, it is natural to 
think of estimating the missing part of the second kernel matrix, e.g., by looking for correlations between the two 
kernels. 

This problem arises commonly in applications such as bioinformatics, where informative data for a given 
classification task is expensive to produce while less informative data are easily available. As an example, the 
DNA sequences of all proteins of a given organism are easily obtained from the sequenced genome, while the 3D 
structures of most proteins are still unknown and difficult to obtain. In this case, we want to know the 3D structure 
information for the proteins whose structure have not been determined. As another example, high-throughput 
genome- wide data (e.g., gene expression data) are available for the full genes of an organism, while the metabolic 
network information is known only for a limited number of genes. In this case, we want to predict the unknown 
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part of gene network using the fully available genome-wide data. A variety of methods have been proposed in 
order to tackle such problems related to kernel matrix completion. Examples include the use of kernel canonical 
correlation analysis (kernel CCA) for gene classification [3] and gene network inference [4], and the use of the 
em-algorithm based on information geometry for protein 3D structure prediction [5]. 

In this paper we attack the kernel matrix completion problem from the viewpoint of regression, considering 
the two kernel matrices represent inner products between explanatory and response variables, respectively. We 
propose a variant of the regression model based on the underlying features in the reproducing kernel Hilbert space 
by modifying the idea of kernel CCA, and we estimate the missing entries by fitting this model to the existing 
samples. In the experiment, we show promising results on the prediction of missing edges in a gene network 
from genome- wide data and on the prediction of protein 3D structure information from their sequences. We also 
discuss the mathematical and numerical relationships between the proposed methods and the em-algorithm based 
on information geometry. 

2 Formalism of the question 

Suppose that we have an explanatory random variable x G R'^ and a response random variable y € R^. Let us 
now consider the situation where the data is available for all the objects for the explanatory variable x, while the 
data is available for the first n objects and unavailable for the remaining (N — n) objects for the response variable 
y. We refer to the first n objects as training set, and we refer to the remaining N — n objects as test set below. 

Let k and g be symmetric positive definite kernels defined on R'^ and R', respectively. When we compute the 
kernel matrix for the explanatory variable x, we obtain an x kernel matrix K, where {K)ij = /i;(xj, Xj) (1 < 
hj < N),-Ki belongs to a set X and is the number of all objects. On the other hand, when we compute the kernel 
matrix for the response variable y, we obtain an A^ x A^ kernel matrix G, where {G)ij = g{yi , yj) (1 < i, j < n), 
Yi belongs to a set y, and n is the number of available objects (n < A^). Note that G contains in fact missing 
values for all entries {G)ij with max(i, j) > n. We want to estimate the missing part of G using full Gram matrix 
K, taking into account a form of correlation between the two kernels. 

In this study we express each kernel matrix by splitting the matrix into four parts. We denote by Ku (resp. 
Gtt) the n X n kernel matrix for the training set versus itself, Kpt (resp. Gpt) the (A^ — n) x n kernel matrix for 
the test set versus the training set, and Kpp (resp. Gpp) the (A^ — n)x (N — n) kernel matrix for the test set versus 



Note that Kpt and Kpp are known, while Gpt and Gpp are unknown. The goal is to predict Gpt and Gpp from K 
and Gtt- 



In this section we describe four approaches that can be used for the problem of kernel matrix completion: the 
direct approach (Section 13.11 ) is a baseline straightforward approach, the kernel CCA approach (Section 13.21 ) has 
been proposed in previous work [4], while the kernel matrix regression (Section [33] ) and penalized kernel matrix 
regression (Section [I4b are new. 

3.1 Direct approach 

A straightforward approach is to directly plug the entries of the kernel matrix K for the explanatory variable x into 
the missing entries of the kernel matrix G for the response variable y, that is, to choose Gpt = Kpt and Gpp = Kpp. 
We refer to this approach as the direct approach. 



itself: 
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3.2 Kernel CCA (kCCA) 



The use of kernel canonical correlation analysis (kCCA) has been proposed to estimate the unknown part of the 
metaboUc network form genomic data [4]. We make a brief review of this approach in this subsection. 

This approach amounts to searching low-dimensional feature spaces derived from both kernels that are maxi- 
mally correlated during the training phase. The reconstruction of missing entries in G is then obtained by projecting 
the corresponding points onto the feature space for the kernel K, and computing their inner product in this feature 
space as an approximation of the kernel G. More formally, let us write a feature u : R*^ R for the explanatory 
variable x, and a feature -y : R' — > R for the response variable y in the reproducing kernel Hilbert spaces as 
follows: 

n Tt 

and use the notation oc = (ai, a2, • • • , c(n)~^ and (3 = (Pi,P2, - ■ ■ , Pn)'^ ■ The objective is to find features u and v 
that are as correlated as possible, that is, which maximize the following correlation 

„ , , Covin, v) 

p = Corriu, V) = y^^^^y„y^^^^y/,, (3) 

where Cov{u, v) = E[uv) — E{u)E{y) and Var{u) = E{u^) — E{u)'^. For theoretical and practical reasons [6] 
it is better to compute features u and v which maximize the following penalized canonical correlation: 

a'^KuGttP 



where I is an identity matrix, and and Aj, are positive regularization parameters, and the matrices Ku and 
Gtt are assumed to be centered. When m different features u^^^ and v^^^ associated the m largest canonical 
correlations p^^\j = 1, 2, • • • , m are obtained, they can be merged into feature vectors u(x) and v(y) as u(x) = 
('^(■^^(x), • ■ ■ ,u^'^\x))~^ and v(y) = (^^''"•'(y), • ■ ■ ,v^'^\y))'^ . The missing entries in G are then estimated as 
^(y,y') = u(x)"ru(x'). 



3.3 Kernel matrix regression (KMR) 

An apparent drawback of the kCCA approach is that the objective function of kCCA is different from that of 
correctly predicting the values of the kernel G. In particular, by computing features v for the response variable y, 
the notion of similarity between response variable y is changed. In the problem of kernel matrix completion, we 
do not want to change the similarity space for the response variable y. We want instead to change the object-object 
similarity space only for the explanatory variable x to make it fit the the object-object similarity space for the 
response variable y. In this section we propose a variant of the regression model based on the underlying features 
in the reproducing kernel Hilbert space by modifying the idea of kernel CCA. 

The ordinary regression model between an explanatory variable x G R*^ and a response variable y G R can be 
formulated as follows: 

y = /(x) + e, (5) 

where / : R'^ R and e is a noise term. By analogy we propose to regard (x, x') G R*^ x R'^ as an explanatory 
variable and g{y, y') G R as a response variable in our context. Assuming the underlying feature u(x) G R"* in 
the reproducing kernel Hilbert space, we formulate a variant of the regression model as follows: 

g{y, y) = /(x, x') + e = u(x)^u(x') + e, (6) 

where / : R'^ x R'^ — > R. We refer to this model as kernel matrix regression model. We note that imposing / 
to be of the form /(x, x') = u(x)^u(x') for some feature u : R'^ —>■ R™ ensures that the regression function is 
positive definite and the number of dimension m of the feature u is allowed to be infinite. 
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Following a classical approach in kernel methods, we consider features in the reproducing kernel Hilbert space 
of the kernel K that possess an expansion of the form: 

n 

=^K^^^j)wj, (7) 

where w = {wi,W2, ■ ■ ■ , Wn)'^ is a weight vector and n is the number of objects in the training set. When m dif- 
ferent features are considered, we express them by a feature vector u as u(x) = (x), u^^^ (x), • • • , u^™) (x))"''. 

In order to represent the set of features for all the objects, we define feature score matrices Ut (x) = [u(xi ) , u(x2 ) , 
for the training set and t/p(x) = [u(x„_|_i), u(x„+2), ■ • • , u(xiv)]^ for the test set. 

In the matrix form, we can actually compute the feature score matrices as Ut = K^W for the training set and 
Up = KptW for the test set, where W = [w(^), w^^), . . . , w^™)]. 

The inner products of the feature vectors between two objects are g(x, x') = u(x)^u(x'). To represent all the 
object-object similarities in the feature space, we define the similarity matrix Q as 

{Q)ij = = u(xi)^u(xj), I <i,j < N. 

Splitting the matrix Q into several parts according to the training set, test set and their interaction, we can compute 
them as follows: 



Training set versus Training set: 

Test set versus Training set: 
Test set versus Test set: 



Qtt = UtUj = KttWW^Kl , (8) 
Qpt = UpUj = KptWW^Kl , (9) 



Qpp = UpUj = KptWW^Kjt . (10) 

Here we want to find the n x m weight matrix W such that Qtt fits Gtt as much as possible. If we set 
A = WW^, this problem can be replaced by finding A which minimizes the difference between Gtt and Qtt- 
It means that, this enables us to avoid considerable computational burden for computing W itself, even if m is 
infinite. Therefore, we attempt to find A{= WW~^) which minimizes 

L =\\ Gtt - KttAKjt (11) 

where || • indicates the Frobenius norm. We can rewrite the above equation in the trace form as 

L = tr { {Gtt - KttAKl){Gtt - KttAKj^f } . (12) 

The derivative of L with respect to A is obtained as 

From setting = 0, the solution is analytically obtained by 

A = WW^ = Krt'GttKf,\ 
Then, the feature score matrix U can be computed for training set and test set, respectively, as follows: 

Ut = KttKi,^Gl{\ Up = KptKi,^Gl{\ (13) 

Therefore, we can compute the feature-based similarity matrix Q involving the test set as follows: 
Test set versus Training set: 

Qpt = UpUj = KptKit^Gtt, (14) 

Test set versus Test set: 

Qpp = UpUj = KptKi,'GttKi,'Kj,. (15) 

By using the Qpt and Qpp, we can predict the missing entries in the kernel matrix G, which correspond to Gpt and 
Gpp. 
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3.4 Penalized kernel matrix regression (PKMR) 

Here we consider introducing the idea of regularization in the KMR proposed in the previous section. To do so, 
we attempt to find A{= WW~^) which minimizes the following penalized loss function: 

L =11 Gtt - KuAKtt \\l +\PEN{A), (16) 

where A is a regularization parameter and PEN{A) is a penalty term for A defined as follows. Each positive 
semidefinite matrix A can be expanded as A = Y^^=i ^i^J ^ where the (vifj)j=i ... „ form an orthogonal basis of 
eigenvectors. To each Wj is associated a feature : R"' ^ R by (|7]), whose norm in the RKHS of K is given by: 

n 

II \\'rkhs= X! '^i,j^i,kK'^j,^k) =tr(wiw7-fC) . 

j,k=l 

To enforce regularity of the global mapping u, we therefore define the following penalty for A: 

n n 

PEN {A) = 2^ II ^z, 111^^5= 2^ tr(w,wTK,i) = 2tr(^K«) . 

i=l i=l 

In this case, the optimization problem is reduced to finding A which minimizes 

L = ti {{Gtt - KttAKl){Gtt - KttAKl)^] + 2Atr {AKtt} . (17) 
The derivative of L with respect to A is 

2 9l = -KttGttKtt + KlAKl + AK«. 
Therefore, the solution of the above penalized optimization problem is obtained by 

A = K^^\Gtt-\Kt^)K^^\ 

We note that the justification for the penalty used is only valid for positive semidefinite matrices, which will be 
obtained at least for small enough A. Therefore, we can compute the feature-based similarity matrix Q involving 
the test set as follows: 

Test set versus Training set: 

Qpt = UpUj = K^tKuHGtt - XKu'), (18) 

Test set versus Test set: 

Qpp = UpUj = K,tK^,HGtt - XKi,')Ki,'K;,. (19) 

By using the Qpt and Qpp, we can predict the missing entries in the kernel matrix G, which correspond to Gpt and 
Gpp. 

4 Relationship with the em-algorithm 

For the kernel matrix completion problem, the use of the em algorithm based on information geometry has been 
proposed [7]. There the kernel matrix completion problem is defined as finding missing entries that minimize the 
Kullback-Leibler divergence between the resulting completed matrix and a spectral variant of the full matrix. 

It is interesting to observe that the final algorithms between em and KMR are very similar. The em algorithm 
results in the following equations for estimating the incomplete parts Gpt and Gpp in G: 
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Test set versus Training set: 

Qpt = KptKft^Gtt, (20) 

Test set versus Test set: 

Qpp = Kpp + KptKr,'Kj, + KptKi,'GuKi,'Kj,. (21) 

We note that the Qpt of the em-algorithm is equivalent to that of the kernel matrix regression. On the other 
hand, the Qpp of the em-algorithm is not equivalent to that of the kernel matrix regression. It differs by Kpp + 
KptK^^ Kjf. This stems from the difference of the geometry space between the two methods. The em-algorithm 
is based on the information geometry, while the proposed KMR is based on the Euclidean geometry. 

5 Experiments 

In this section we report an empirical comparison of different methods: 1) direct method, 2) kCCA method, 3) 
em-method, 4) KMR method, 5) PKMR method applied to the problem of gene network inference and protein 3D 
structure prediction. 

5.1 Estimation of missing edges in the metabolic gene network 

The metabolic gene network is an important biological network. However, most parts of the metabolic gene 
network remain unknown, and many enzyme genes are still missing in our current knowledge. Determining new 
enzyme genes and their position in the metabolic network is an expensive and painstaking process that requires 
many wet experiments. On the contrary, we can easily obtain various genome-wide genomic datasets representing 
gene/protein information, such as gene expression data, protein localization data, and phylogenetic profiles. We 
therefore attempt to predict missing edges in the metabolic gene network by using such genomic data. 

We gathered a kernel matrix of the genomic data (consisting of 769 genes) by combining three kernel matrices 
obtained from three datasets: gene expression data, protein locahzation data, and phylogenetic profiles, and we 
regard this matrix as an explanatory kernel matrix K. We used the same datasets and corresponding kernels as 
those used by [4]. We obtained a kernel matrix for the gene network from the graph information of the gene 
network by using the diffusion kernel [8] with parameter (7 = 1, and we regard this matrix as a response kernel 
G. The kernel matrix K is invertible in this case. The kernel similarity values in G (transformed by the diffusion 
kernel from the graph of the gene network) are expected to represent the intensity of graphical association between 
genes, which can be considered as a possibihty of the existence of the edge. Therefore, if the gene pairs sharing 
similarities higher than a threshold, they are predicted to interact with each other. 

To compare the performance between different methods, we applied the direct approach, the kCCA, the em- 
algorithm, the KMR, and the penalized KMR (PKMR) to the gene network prediction. We tested their performance 
by cross-validation. In each cross-validation iteration, we randomly split the genes in the gold standard data into 
training set and test set. We learned the model based on the training dataset only and we applied the model to the 
test set in order to predict the missing edges involving the test set on the metabolic network. We are also interested 
in the effect of the rate of the test samples against the training samples, so we carried out the same experiment with 
different percentages of the test samples in the data splitting process in each cross-vahdation iteration. 

As a measure of the performance, we used the AUC score (area under the ROC curve), because the performance 
depends on the threshold given in advance. The ROC curve is defined as a function of the true positive rates against 
the false positive rates based on several threshold values. "True positive" means that the predicted gene-pairs are 
actually present in the gold standard network, while "false positive" means that the predicted gene-pairs are absent 
in the gold standard network. In the case of the kCCA, we set the regularization parameters A^; and \y as 0. 1 and 
0.1, and we used 30 features, as suggested by [4]. In the case of the PKMR, the regularization parameter A is 
optimized by applying the internal cross-validation within the training set with the AUC score as a target criterion, 
which provides us with A = 0.1. 



6 



Table 1: Comparison of AUC scores with varying rates of training set: Qtp (training versus test). 



Rate 


DIRECT 


KCCA 


EM 


KMR 


n XT' TV 4" Tl 

FKMK 


90 % 


0.598 


0.840 


0.889 


0.889 


0.892 


ou /o 


U. J / u 


U.OZM- 








70 % 


0.580 


0.783 


0.805 


0.805 


0.814 


60 % 


0.575 


0.780 


0.786 


0.786 


0.801 


50 % 


0.579 


0.772 


0.783 


0.783 


0.772 


40 % 


0.569 


0.714 


0.760 


0.760 


0.758 


30 % 


0.571 


0.682 


0.732 


0.732 


0.738 


20 % 


0.565 


0.633 


0.674 


0.674 


0.676 


10 % 


0.593 


0.669 


0.672 


0.672 


0.676 



Table 2: Comparison of AUC scores with varying rates of training set: Qpp (test versus test). 



Rate 


direct 


kCCA 


EM 


KMR 


PKMR 


90 % 


0.531 


0.785 


0.766 


0.774 


0.787 


80 % 


0.593 


0.727 


0.724 


0.723 


0.743 


70 % 


0.602 


0.680 


0.686 


0.700 


0.703 


60 % 


0.558 


0.673 


0.683 


0.678 


0.675 


50 % 


0.581 


0.661 


0.644 


0.651 


0.662 


40 % 


0.569 


0.646 


0.635 


0.635 


0.642 


30 % 


0.583 


0.610 


0.621 


0.627 


0.637 


20 % 


0.579 


0.587 


0.587 


0.567 


0.576 


10 % 


0.568 


0.591 


0.585 


0.573 


0.589 



All the result of the experiments are summarized in Tables 1 and 2. Table 1 shows the result in estimating 
Qpt, while Table 2 shows the result in estimating Qpp. In each table, the rows correspond to the percentage of the 
training samples against the test samples and the columns correspond to the methods. It appears that the direct 
approach performs significantly worse than the other supervised learning based methods. It seems that, the kCCA 
performs worse than the em-algorithm, the KMR and PKMR in estimating Qpt, while the kCCA performs better 
than the em-algorithm and the KMR, and at competitive level with the PKMR in estimating Qpp. Focusing on the 
comparison of the performance between the em-algorithm and the KMR, both the em-algorithm and the KMR show 
the same performance in estimating Qpt, as expected from the mathematical relationship between the em-algorithm 
and the KMR. On the other hand, the em-algorithm and the KMR behave differently in estimating Qpp, but their 
performances are at competitive level. The penalized KMR (PKMR) slightly outperforms the other methods in 
estimating both Qpt and Qpp, suggesting that the introduction of regularization can be meaningful in this context. 

5.2 Prediction of protein 3D structures from their sequences 

Protein 3D structures are strongly associated with evolutionary history and biological functions, compared with 
protein sequences. Here we attempt to classify proteins into superfamilies based on the structure information. 
However, the number of proteins whose structures are determined is limited even nowadays and the structure 
information of most proteins is almost unknown. Therefore, we performed protein classification by predicting 
missing similarity elements of protein 3D structures. 

The sequence similarities are obtained by marginalized kernel [9] and the 3D structural similarities are obtained 
by the result of MATRAS [10]. We used the same datasets and corresponding kernel matrices used in previous 
work [5], where K corresponds to the similarity matrix for protein sequences and G corresponds to the similarity 
matrix for protein 3D structures. The kernel matrix K is invertible in this case. We applied the support vector 
machine (SVM) to the dataset of TIM beta/alpha-barrel protein fold (18 classes, 90 proteins), and conducted 
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one-versus-other supervised classifications for class 1 and class 3, respectively. The reason why we selected the 
above two classes is that they have more than 10 members in each class. The leave-one-out cross-validation is 
conducted and the performance is evaluated by using sensitivity and specificity, where sensitivity is defined as 
#TP/{#FN + #TP) and specificity is defined as i^TP/{i^FP + #TP), respectively In the case of PKMR, 
the regularization parameter A is optimized by applying the internal cross-vaUdation within the training set, and 
set to be 0.1. 



Table 3: Comparison of sensitivities (left) and specificities (right): Class 1. 



Rate 


direct 


kCCA 


em 


KMR 


PKMR Rate 


direct 


kCCA 


em 


KMR 


PKMR 


0% 


0.47 








- 0% 


0.66 










20% 




0.82 


0.76 


0.76 


0.88 20% 




0.45 


0.43 


0.41 


0.42 


40% 




0.70 


0.70 


0.70 


0.76 40% 




0.36 


0.4 


0.36 


0.38 


60% 




0.70 


0.88 


0.88 


0.94 60% 




0.5 


0.53 


0.48 


0.51 


80% 




0.76 


0.88 


0.88 


0.88 80% 




0.81 


0.93 


0.93 


0.93 


100% 


0.94 


0.94 


0.94 


0.94 


0.94 100% 


1.0 


1.0 


1.0 


1.0 


1.0 



Table 4: Comparison of sensitivities (left) and specificities (right): Class 3. 



Rate 


direct 


kCCA 


em 


KMR 


PKMR Rate 


direct 


kCCA 


em 


KMR 


PKMR 


0% 


0.58 








- 0% 


0.62 










20% 




0.64 


0.76 


0.76 


0.82 20% 




0.40 


0.44 


0.34 


0.35 


40% 




0.64 


0.82 


0.88 


0.88 40% 




0.39 


0.53 


0.45 


0.42 


60% 




0.76 


0.82 


0.82 


0.88 60% 




0.56 


0.51 


0.51 


0.53 


80% 




0.76 


0.94 


0.94 


1.0 80% 




0.68 


0.94 


0.94 


0.85 


100% 


1.0 


1.0 


1.0 


1.0 


1.0 100% 


0.94 


0.94 


0.94 


0.94 


0.94 



Table 3 and Table 4 show the results of computing sensitivities and specificities depending on the rate of 
training set for class 1 and class 3, respectively, in the leave-one-out cross-validation experiments. The direct 
method with 0% means that we use sequence information only, while the direct method with 100% means that we 
use structure information only. Looking at the tables, the PKMR seems to outperform the other methods in this 
context especially from the viewpoint of sensitivity. In contrast, the performance seems to be at competitive level 
across different methods from the viewpoint of specificity. 

6 Discussions and conclusions 

In this paper we addressed the problem of filUng missing entries in a kernel Gram matrix, given a related full 
kernel Gram matrix. We attacked this kernel matrix completion problem from the viewpoint of regression. We 
proposed the kernel matrix regression (KMR) based on the underlying features in the reproducing kernel Hilbert 
space by modifying the idea of kCCA. Through the development of the KMR, we also clarified the mathematical 
relationship between the kCCA, the em-algorithm, and the proposed methods in the context of the kernel matrix 
completion problem. In the experiment on gene network inference and protein 3D prediction, we confirmed that 
the performance of the KMR is competitive with that of other methods, and we showed that the penahzed version 
of the KMR works the best when an appropriate regularization parameter is chosen. 
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